In [ ]:
%%HTML
<style>
.container { width:100% } 
</style>

Polynomial Logistic Regression


In [ ]:
import numpy  as np
import pandas as pd

The data we want to investigate is stored in the file 'fake-data.csv'. It is data that I have found somewhere. I am not sure whether this data is real or fake. Therefore, I won't discuss the attributes of the data. The point of the data is that it is a classification problem that can not be solved with ordinary logistic regression. We will introduce polynomial logistic regression to solve this problem.


In [ ]:
DF = pd.read_csv('fake-data.csv')
DF.head()

We extract the features from the data frame and convert it into a NumPy feature matrix.


In [ ]:
X = np.array(DF[['x','y']])

We extract the target column and convert it into a NumPy array.


In [ ]:
Y = np.array(DF['class'])

In order to plot the instances according to their class we divide the feature matrix $X$ into two parts. $\texttt{X_pass}$ contains those examples that have class $1$, while $\texttt{X_fail}$ contains those examples that have class $0$.


In [ ]:
X_pass = X[Y == 1.0]
X_fail = X[Y == 0.0]

Let us plot the data.


In [ ]:
import matplotlib.pyplot as plt
import seaborn           as sns

In [ ]:
plt.figure(figsize=(15, 10))
sns.set(style='darkgrid')
plt.title('A Classification Problem')
plt.axvline(x=0.0, c='k')
plt.axhline(y=0.0, c='k')
plt.xlabel('x axis')
plt.ylabel('y axis')
plt.xticks(np.arange(-0.9, 1.1, step=0.1))
plt.yticks(np.arange(-0.8, 1.2, step=0.1))
plt.scatter(X_pass[:,0], X_pass[:,1], color='b') 
plt.scatter(X_fail[:,0], X_fail[:,1], color='r')

We want to split the data into a training set and a test set. The training set will be used to compute the parameters of our model, while the testing set is only used to check the accuracy. SciKit-Learn has a predefined method sklearn.model_selection import train_test_split that can be used to randomly split data into a training set and a test set.


In [ ]:
from sklearn.model_selection import train_test_split

We will split the data at a ratio of $4:1$, i.e. $80\%$ of the data will be used for training, while the remaining $20\%$ is used to test the accuracy.


In [ ]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=1)

In order to build a logistic regression classifier, we import the module linear_model from SciKit-Learn.


In [ ]:
import sklearn.linear_model as lm

The function $\texttt{logistic_regression}(\texttt{X_train}, \texttt{Y_train}, \texttt{X_test}, \texttt{Y_test})$ takes a feature matrix $\texttt{X_train}$ and a corresponding vector $\texttt{Y_train}$ and computes a logistic regression model $M$ that best fits these data. Then, the accuracy of the model is computed using the test data $\texttt{X_test}$ and $\texttt{Y_test}$.


In [ ]:
def logistic_regression(X_train, Y_train, X_test, Y_test, reg=10000):
    M = lm.LogisticRegression(C=reg, tol=1e-6, solver='newton-cg')
    M.fit(X_train, Y_train)
    train_score = M.score(X_train, Y_train)
    yPredict    = M.predict(X_test)
    accuracy    = np.sum(yPredict == Y_test) / len(Y_test)
    return M, train_score, accuracy

We use this function to build a model for our data. Initially, we will take all the available data to create the model.


In [ ]:
M, score, accuracy = logistic_regression(X, Y, X, Y)
score, accuracy

Given that there are only two classes, the accuracy of our first model is quite poor.
Let us extract the coefficients so we can plot the decision boundary.


In [ ]:
ϑ0     = M.intercept_[0]
ϑ1, ϑ2 = M.coef_[0]

In [ ]:
plt.figure(figsize=(15, 10))
sns.set(style='darkgrid')
plt.title('A Classification Problem')
plt.axvline(x=0.0, c='k')
plt.axhline(y=0.0, c='k')
plt.xlabel('x axis')
plt.ylabel('y axis')
plt.xticks(np.arange(-0.9, 1.1, step=0.1))
plt.yticks(np.arange(-0.8, 1.2, step=0.1))
plt.scatter(X_pass[:,0], X_pass[:,1], color='b') 
plt.scatter(X_fail[:,0], X_fail[:,1], color='r') 
H = np.arange(-0.8, 1.0, 0.05)
P = -(ϑ0 + ϑ1 * H)/ϑ2
plt.plot(H, P, color='green')

Clearly, pure logistic regression is not working for this example. The reason is, that a linear decision boundary is not able to separate the positive examples from the negative examples. Let us add polynomial features. This enables us to create more complex decision boundaries.

The function $\texttt{extend}(X)$ takes a feature matrix $X$ that is supposed to contain two features $x$ and $y$. It creates the new features $x^2$, $y^2$ and $x\cdot y$ and returns a new feature matrix that also contains these additional features.


In [ ]:
def extend(X):
    n  = len(X)
    fx = np.reshape(X[:,0], (n, 1)) # extract first column
    fy = np.reshape(X[:,1], (n, 1)) # extract second column
    return np.hstack([fx, fy, fx*fx, fy*fy, fx*fy]) # stack everthing horizontally

In [ ]:
X_train_quadratic = extend(X_train)
X_test_quadratic  = extend(X_test)

In [ ]:
M, score, accuracy = logistic_regression(X_train_quadratic, Y_train, X_test_quadratic, Y_test)
score, accuracy

This seems to work better. Let us compute the decision boundary and plot it.


In [ ]:
ϑ0                 = M.intercept_[0]
ϑ1, ϑ2, ϑ3, ϑ4, ϑ5 = M.coef_[0]

The decision boundary is now given by the following equation: $$ \vartheta_0 + \vartheta_1 \cdot x + \vartheta_2 \cdot y + \vartheta_3 \cdot x^2 + \vartheta_4 \cdot y^2 + \vartheta_5 \cdot x \cdot y = 0$$ This is the equation of an ellipse. Let us plot the decision boundary with the data.


In [ ]:
a    = np.arange(-1.0, 1.0, 0.005)
b    = np.arange(-1.0, 1.0, 0.005)
A, B = np.meshgrid(a,b)
A

In [ ]:
B

In [ ]:
Z    = ϑ0 + ϑ1 * A + ϑ2 * B + ϑ3 * A * A + ϑ4 * B * B + ϑ5 * A * B 
Z

In [ ]:
plt.figure(figsize=(15, 10))
sns.set(style='darkgrid')
plt.title('A Classification Problem')
plt.axvline(x=0.0, c='k')
plt.axhline(y=0.0, c='k')
plt.xlabel('x axis')
plt.ylabel('y axis')
plt.xticks(np.arange(-0.9, 1.1, step=0.1))
plt.yticks(np.arange(-0.8, 1.2, step=0.1))
plt.scatter(X_pass[:,0], X_pass[:,1], color='b') 
plt.scatter(X_fail[:,0], X_fail[:,1], color='r') 
CS = plt.contour(A, B, Z, 0, colors='green')

Let us try to add quartic features next. These are features like $x^4$, $x^2\cdot y^2$, etc. Luckily, SciKit-Learn has function that can automize this process.


In [ ]:
from sklearn.preprocessing import PolynomialFeatures

In [ ]:
quartic = PolynomialFeatures(4, include_bias=False)
X_train_quartic = quartic.fit_transform(X_train)
X_test_quartic  = quartic.fit_transform(X_test)
print(quartic.get_feature_names(['x', 'y']))

Let us fit the quartic model.


In [ ]:
M, score, accuracy = logistic_regression(X_train_quartic, Y_train, X_test_quartic, Y_test)
score, accuracy

The accuracy on the training set has increased, but we observe that the accuracy on the training set is actually not improving. Again, we proceed to plot the decision boundary.


In [ ]:
ϑ0 = M.intercept_[0]
ϑ1, ϑ2, ϑ3, ϑ4, ϑ5, ϑ6, ϑ7, ϑ8, ϑ9, ϑ10, ϑ11, ϑ12, ϑ13, ϑ14 = M.coef_[0]

Plotting the decision boundary starts to get tedious.


In [ ]:
a    = np.arange(-1.0, 1.0, 0.005)
b    = np.arange(-1.0, 1.0, 0.005)
A, B = np.meshgrid(a,b)
Z    = ϑ0 + ϑ1 * A + ϑ2 * B + \
       ϑ3 * A**2 + ϑ4 * A * B + ϑ5 * B**2 + \
       ϑ6 * A**3 + ϑ7 * A**2 * B + ϑ8 * A * B**2 + ϑ9 * B**3 + \
       ϑ10 * A**4 + ϑ11 * A**3 * B + ϑ12 * A**2 * B**2 + ϑ13 * A * B**3 + ϑ14 * B**4

In [ ]:
plt.figure(figsize=(15, 10))
sns.set(style='darkgrid')
plt.title('A Classification Problem')
plt.axvline(x=0.0, c='k')
plt.axhline(y=0.0, c='k')
plt.xlabel('x axis')
plt.ylabel('y axis')
plt.xticks(np.arange(-0.9, 1.1, step=0.1))
plt.yticks(np.arange(-0.8, 1.2, step=0.1))
plt.scatter(X_pass[:,0], X_pass[:,1], color='b') 
plt.scatter(X_fail[:,0], X_fail[:,1], color='r') 
CS = plt.contour(A, B, Z, 0, colors='green')

The decision boundary looks strange. Lets get bold and try to add features of a higher power.
However, in order to understand what is happening, we will only plot the training data.


In [ ]:
X_pass_train = X_train[Y_train == 1.0]
X_fail_train = X_train[Y_train == 0.0]

In order to automatize the process, we define some auxilliary functions.

$\texttt{polynomial}(n)$ creates a polynomial in the variables A and B that contains all terms of the form $\Theta[k] \cdot A^i \cdot B^j$ where $i+j \leq n$.


In [ ]:
def polynomial(n):
    sum = 'Θ[0]' 
    cnt = 0
    for k in range(1, n+1):
        for i in range(0, k+1):
            cnt += 1
            sum += f' + Θ[{cnt}] * A**{k-i} * B**{i}'
    print('number of features:', cnt)
    return sum

Let's check this out for $n=4$.


In [ ]:
polynomial(4)

The function $\texttt{polynomial_grid}(n, M)$ takes a number $n$ and a model $M$. It returns a meshgrid that can be used to plot the decision boundary of the model.


In [ ]:
def polynomial_grid(n, M):
    Θ    = [M.intercept_[0]] + list(M.coef_[0])
    a    = np.arange(-1.0, 1.0, 0.005)
    b    = np.arange(-1.0, 1.0, 0.005)
    A, B = np.meshgrid(a,b)
    return eval(polynomial(n))

The function $\texttt{plot_nth_degree_boundary}(n)$ creates a polynomial logistic regression model of degree $n$. It plots both the training data and the decision boundary.


In [ ]:
def plot_nth_degree_boundary(n, C=10000):
    poly         = PolynomialFeatures(n, include_bias=False)
    X_train_poly = poly.fit_transform(X_train)
    X_test_poly  = poly.fit_transform(X_test)
    M, score, accuracy = logistic_regression(X_train_poly, Y_train, X_test_poly, Y_test, C)
    print('The accuracy on the training set is:', score)
    print('The accuracy on the test     set is:', accuracy)
    Z = polynomial_grid(n, M)
    plt.figure(figsize=(15, 10))
    sns.set(style='darkgrid')
    plt.title('A Classification Problem')
    plt.axvline(x=0.0, c='k')
    plt.axhline(y=0.0, c='k')
    plt.xlabel('x axis')
    plt.ylabel('y axis')
    plt.xticks(np.arange(-0.9, 1.11, step=0.1))
    plt.yticks(np.arange(-0.8, 1.21, step=0.1))
    plt.scatter(X_pass_train[:,0], X_pass_train[:,1], color='b') 
    plt.scatter(X_fail_train[:,0], X_fail_train[:,1], color='r') 
    CS = plt.contour(A, B, Z, 0, colors='green')

Let us test this for the polynomial logistic regression model of degree $4$.


In [ ]:
plot_nth_degree_boundary(4)

This seems to be the same shape that we have seen earlier. It looks like the function $\texttt{plot_nth_degree_boundary}(n)$ is working. Lets try higher degree polynomials.


In [ ]:
plot_nth_degree_boundary(5)

The score on the training set has improved. What happens if we try still higher degrees?


In [ ]:
plot_nth_degree_boundary(6)

We captured one more of the training examples. Lets get bold, we want a $100\%$ training accuracy.


In [ ]:
plot_nth_degree_boundary(14)

The model is getting more complicated, but it is not getting better, as the accuracy on the test set has not improved.


In [ ]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2)
X_pass_train = X_train[Y_train == 1.0]
X_fail_train = X_train[Y_train == 0.0]

Let us check whether regularization can help. Below, the regularization parameter prevents the decision boundary from becomming to wiggly and thus the accuracy on the test set can increase. The function below plots all the data.


In [ ]:
def plot_nth_degree_boundary_all(n, C):
    poly         = PolynomialFeatures(n, include_bias=False)
    X_train_poly = poly.fit_transform(X_train)
    X_test_poly  = poly.fit_transform(X_test)
    M, score, accuracy = logistic_regression(X_train_poly, Y_train, X_test_poly, Y_test, C)
    print('The accuracy on the training set is:', score)
    print('The accuracy on the test     set is:', accuracy)
    Z = polynomial_grid(n, M)
    plt.figure(figsize=(15, 10))
    sns.set(style='darkgrid')
    plt.title('A Classification Problem')
    plt.axvline(x=0.0, c='k')
    plt.axhline(y=0.0, c='k')
    plt.xlabel('x axis')
    plt.ylabel('y axis')
    plt.xticks(np.arange(-0.9, 1.11, step=0.1))
    plt.yticks(np.arange(-0.8, 1.21, step=0.1))
    plt.scatter(X_pass[:,0], X_pass[:,1], color='b') 
    plt.scatter(X_fail[:,0], X_fail[:,1], color='r') 
    CS = plt.contour(A, B, Z, 0, colors='green')

In [ ]:
plot_nth_degree_boundary_all(14, 100.0)

In [ ]:
plot_nth_degree_boundary_all(20, 100000.0)

In [ ]:


In [ ]: